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Evaluation of Safety- 
Critical Software 

Methods and approaches for testing the reliability and trustworthiness of 
software remain among the most controversial issues facing this age of high 
technology. The authors present some of the crucial questions faced by 
software programmers and eventual users. 
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It is increasingly common to use programmable com- 
puters in applications where their failure could be life- 
threatening and could result in extensive damage. For 
example, computers now have safety-critical functions 
in both military and civilian aircraft, in nuclear plants, 
and in medical devices. It is incumbent upon those 
responsible for programming, purchasing, installing, 
and licensing these systems to determine whether or 
not the software is ready to be used. This article ad- 
dresses questions that are simple to pose but hard to 
answer. What standards must a software product satisfy 
if it is to be used in safety-critical applications such as 
those mentioned? What documentation should be re- 
quired? How much testing is needed? How should the , 
software be structured? 

This article differs from others concerned with soft- 
ware in safety-critical applications, in that it does not 
attempt to identify safely as a property separate from 
reliability and trustworthiness. In other words, we do 
not attempt to separate safety-critical code from other 
code in a product used in a safety-critical application. 
In our experience, software exhibits weak-link behavior, 
that is failures in even the unimportant parts of the 
code can have unexpected repercussions elsewhere. 
For a discussion of another viewpoint, we suggest the 
work of N. G. Leveson [6, 7, 8]. 

We favor keeping safety-critical software as small and 
simple as possible by moving any functions that are not 
safety critical to other computers. This further justifies 
our assumption that all parts of a safety-critical soft- 
ware product must be considered safety critical. 

WHY IS SOFTWARE A SPECIAL CONCERN? 
Within the engineering community software systems 
have a reputation for being undependable, especially in 
the first years of their use. The public is aware of a few 
spectacular stories such as the Space Shuttle flight that 
was delayed by a software timing problem, or the Ve- 
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nus probe that was lost because of a punctuation error. 
In the software community, the problem is known to be 
much more widespread. 

A few years ago, David Benson, professor of Com- 
puter Science at Washington State University, issued a 
challenge by way of several electronic bulletin board 
systems. He asked for an example of a real-time system 
that functioned adequately when used for the first time 
by people other than its developers for a purpose other 
than testing. Only one candidate for this honor was 
proposed, but even that candidate was controversial. It 
consisted of approximately 18,000 instructions, most of 
which had been used for several years before the "first 
use." The only code that had not been used before that 
first use was a simple sequence of 200 instructions that 
simulated a simple analogue servomechanism. That in- 
struction sequence had been tested extensively against 
an analogue model. All who have looked at this pro- 
gram regard it as exceptional. If we choose to regard 
this small program as one that worked in its first real 
application, it is the proverbial "exception that proves 
the rule." 

As a rule software systems do not work well until 
they have been used, and have failed repeatedly, in 
real applications. Generally, many uses and many 
failures are required before a product is considered 
reliable. Software products, including those that have 
become relatively reliable, behave like other products 
of evolution-like processes; they often fail, even years 
after they were built, when the operating conditions 
change. 

While there are errors in many engineering products, 
experience has shown that errors are more common, 
more pervasive, and more troublesome, in software 
than in other technologies. This information must be 
understood in light of the fact it is now standard prac- 
tice among software professionals to have their product 
go through an extensive series of carefully planned 
tests before real use. The products fail in their first real 
use because the situations that were not anticipated by 
the programmers were also overlooked by the test plan- 
ners. Most major computer-using organizations, both 
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military and civilian, are investing heavily in searching 
for ways to improve the state of the art in software. 
The problem remains serious and there is no sign of a 
"silver bullet." The most promising development is the 
work of Harlan Mills and his colleagues at IBM on a 
software development process known as "clean room" 
[3, 9, 12]. Mills uses randomly selected tests, carried out 
by an independent testing group. The use of randomly 
generated test data reduces the likelihood of shared 
oversights. We will discuss this approach in more detail 
later in this article. 

WHY IS SOFTWARE USED? 

If software is so untrustworthy, one might ask why en- 
gineers do not avoid it by continuing to use hard-wired 
digital and analogue hardware. Here, we list the three 
main advantages of replacing hardware with software: 

1. Software technology makes it practical to build more 
logic into the system. Software-controlled computer 
systems can distinguish a large number of situations 
and provide output appropriate to each of them. 
Hard-wired systems could not obtain such behavior 
without prohibitive amounts of hardware. Program- 
mable hardware is less expensive than the equiva- 
lent hard-wired logic because it is regular in struc- 
ture and it is mass produced. The economic aspects 
of the situation also allow software-controlled sys- 
tems to perform more checking; reliability can be 
increased by periodic execution of programs that 
check the hardware. 

2. Logic implemented in software is, in theory, easier 
to change than logic implemented in hardware. 
Many changes can be made without adding new 
components. When a system is replicated or located 
in a physical position that is hard to reach, it is far 
easier to make changes in software than in hard- 
ware. 

3. Computer technology and software flexibility make 
it possible to provide more information to operators 
and to provide that information in a more useful 
form. The operator of a modern software-controlled 
system can be provided with information that would 
be unthinkable in a pure hardware system. All of 
this can be achieved using less space and power 
than was used by noncomputerized systems. 

These factors explain the replacement of hard-wired 
systems with software-controlled systems in spite of 
software's reputation as an unreliable technology. 

HOW ARE SOFTWARE CONTROLLERS 
LIKE OTHER CONTROLLERS? 
In the next section we will argue that software tech- 
nology requires some refinements in policies and stan- 
dards because of differences between software and 
hardware technology. However, it is important to rec- 
ognize some common properties of software and hard- 
ware control systems. 
In the design and specification of control systems, 



engineers have long known how to use a black box 
mathematical model of the controller. In such models, 
(1) the inputs to the controller are described as mathe- 
matical functions of certain observable environmental 
state variables, (2) the outputs of the controller are de- 
scribed as mathematical functions of the inputs, (3) the 
values of the controlled environmental variables are 
described as mathematical functions of the controller's 
outputs, and (4) the required relation between the con- 
trolled variables and observed variables is described. It 
is then possible to confirm that the behavior of the 
controller meets its requirements. 

It is important to recognize that, in theory, software- 
implemented controllers can be described in exactly 
the same way as black box mathematical models. They 
can also be viewed as black boxes whose output is a 
mathematical function of the input. In practice, they 
are not viewed this way. One reason for the distinction 
is that their functions are more complex (i.e. harder to 
describe) than the functions that describe the behavior 
of conventional controllers. However, [4] and [17] pro- 
vide ample evidence that requirements for real systems 
can be documented in this way. We return to this 
theme later. 

HOW IS SOFTWARE DIFFERENT FROM 
OTHER CONTROLLER TECHNOLOGIES? 

Software problems are often considered growing pains 
and ascribed to the adolescent nature of the field. Un- 
fortunately there are fundamental differences between 
software and other approaches that suggest these prob- 
lems are here to stay. 

Complexity: The most immediately obvious difference 
between software and hardware technologies is their 
complexity. This can be observed by considering the 
size of the most compact descriptions of the software. 
Precise documentation, in a reasonably general nota- 
tion, for small software systems can fill a bookcase. 
Another measure of complexity is the time it takes for 
a programmer to become closely familiar with a system. 
Even with small software systems, it is common to find 
that a programmer requires a year of working with the 
program before he/she can be trusted to make im- 
provements on his/her own. 

Error Sensitivity: Another notable property of software 
is its sensitivity to small errors. In conventional engi- 
neering, every design and manufacturing dimension 
can be characterized by a tolerance. One is not re- 
quired to get things exactly right; being within the 
specified tolerance of the right value is good enough. 
The use of a tolerance is justified by the assumption 
that small errors have small consequences. It is well 
known that in software, trivial clerical errors can have 
major consequences. No useful interpretation of toler- 
ance is known for software. A single punctuation error 
can be disastrous, even though fundamental oversights 
sometimes have negligible effects. 

Hard to Test: Software is notoriously difficult to test 



adequately. It is common to find a piece of software 
that has been subjected to a thorough and disciplined 
testing regime has serious flaws. Testing of analogue 
devices is based on interpolation. One assumes that de- 
vices that function well at two close points will func- 
tion well at points in-between. In software that assump- 
tion is not valid. The number of cases that must be 
tested in order to engender confidence in a piece 
of software is usually extremely large. Moreover, as 
Harlan Mills has pointed out, "testing carried out by 
selected test cases, no matter how carefully and well- 
planned, can provide nothing but anecdotes" [3, 9, 12]. 

These properties are fundamental consequences of 
the fact that the mathematical functions implemented 
by software are not continuous functions, but functions 
with an arbitrary number of discontinuities. The lack of 
continuity constraints on the functions describing pro- 
gram effects makes it difficult to find compact descrip- 
tions of the software. The lack of such constraints gives 
software its flexibility, but it also allows the complex- 
ity. Similarly, the sensitivity to small errors, and the 
testing difficulties, can be traced to fundamental math- 
ematical properties; we are unlikely to discover a mira- 
cle cure. Great discipline and careful scrutiny will al- 
ways be required for safety-critical software systems. 

Correlated Failures: Many of the assumptions normally 
made in the design of high-reliability hardware are in- 
valid for software. Designers of high-reliability hard- 
ware are concerned with manufacturing failures and 
wear-out phenomena. They can perform their analysis 
on the assumption that failures are not strongly corre- 
lated and simultaneous failures are unlikely. Those 
who evaluate the reliability of hardware systems 
should be, and often are, concerned about design errors 
and correlated failures; however in many situations the 
effects of other types of errors are dominant. 

In software there are few errors introduced in the 
manufacturing (compiling) phase; when there are such 
errors they are systematic, not random. Software does 
not wear out. The errors with which software reliabil- 
ity experts must be concerned are design errors. These 
errors cannot be considered statistically independent. 
There is ample evidence that, even when programs for 
a given task are written by people who do not know of 
each other, they have closely related errors [6, 7, 8]. 

In contrast to the situation with hardware systems, 
one cannot obtain higher reliability by duplication of 
software components. One simply duplicates the errors. 
Even when programs are written independently, the 
oversights made by one programmer are often shared 
by others. As a result, one cannot count on increasing 
the reliability of software systems simply by having 
three computers where one would be sufficient 
[6. 7, 8J. 

Lack of Professional Standards: A severe problem in the 
software field is that, strictly speaking, there are no 
software engineers. In contrast to older engineering 
fields, there is no accrediting agency for professional 



software engineers. Those in software engineering have 
not agreed on a set of skills and knowledge that should 
be possessed by every software engineer. Anyone with 
a modicum of programming knowledge can be called a 
software engineer. Often, critical programming systems 
are built by people with no postsecondary training 
about software. Although they may have useful knowl- 
edge of the field in which the software will be applied, 
such knowledge is not a substitute for understanding 
the foundations of software technology. 

SOFTWARE TESTING CONCERNS 

Some engineers believe one can design black box tests 
without knowledge of what is inside the box. This is, 
unfortunately, not completely true. If we know that the 
contents of a black box exhibit linear behavior, the 
number of tests needed to make sure it would function 
as specified could be quite small. If we know that the 
function can be described by a polynomial of order "N." 
we can use that information to determine how many 
tests are needed. If the function can have a large num- 
ber of discontinuities, far more tests are needed. That is 
why a shift from analogue technology to software 
brings with it a need for much more testing. 

Built-in test circuitry is often included in hardware 
to perform testing while the product is in use. Predeter- 
mined values are substituted for inputs, and the out- 
puts are compared to normative values. Sometimes this 
approach is imitated in software designs and the claim 
is made that built-in online testing can substitute for 
black box testing. In hardware, built-in testing tests for 
decay or damage. Software does not decay and physical 
damage is not our concern. Software can be used to test 
the hardware, but its value for testing itself is quite 
doubtful. Software self-testing does increase the com- 
plexity of the product and, consequently, the likelihood 
of error. Moreover, such testing does not constitute ade- 
quate testing because it usually does not resemble the 
conditions of actual use. 

The fundamental limitations on testing mentioned 
earlier have some very practical implications. 

We cannot test software for correctness: Because of the 
large number of states (and the lack of regularity in its 
structure), the number of states that would have to be 
tested to assure that software is correct is preposterous. 
Testing can show the presence of bugs, but, except for 
toy problems, it is not practical to use testing to show 
that software is free of design errors. 

It is difficult to make accurate predictions of software relia- 
bility and availability: Mathematical models show that 
it is practical to predict the reliability of software, pro- 
vided that one has good statistical models of the actual 
operating conditions. Unfortunately, one usually gains 
that information only after the system is installed. Even 
when a new system replaces an existing one, differ- 
ences in features may cause changes in the input distri- 
bution. Nonetheless, in safety-critical situations, one 
must attempt to get and use the necessary statistical 
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data. The use of this data is discussed later in this 
article. 

Predictions of availability are even more difficult; 
estimates of availability depend on predictions of the 
time it will take to correct a bug in the software. We 
never know what that amount of time will be in ad- 
vance; data from earlier bugs is not a good predictor of 
the time it will take to find the next bug. 

It is not practical to measure the trustworthiness of soft- 
ware: We consider a product to be trustworthy if we 
believe that the probability of it having a potentially 
catastrophic flaw is acceptably low. Whereas reliability 
is a measure of the probability of a problem occurring 
while the system is in service, trustworthiness is a 
measure of the probability of a serious flaw remaining 
after testing and review. In fact, inspection and testing 
can increase the trustworthiness of a product without 
affecting its reliability. 

Software does not need to be correct in order to be 
trustworthy. We will trust imperfect software if we be- 
lieve its probability of having a serious flaw is very low. 
Unfortunately, as we will show, the amount of testing 
necessary to establish high confidence levels for most 
software products is impractically large. The number of 
states and possible input sequences is so large that the 
probability of an error having escaped our attention 
will remain high even after years of testing. Methods 
other than testing must be used to increase our trust in 
software. 

There is a role for testing: A number of computer scien- 
tists, aware of the limitations on software testing, would 
argue that one should not test software. They would 
argue that the effort normally put into testing should, 
instead, be put into a form of review known as mathe- 
matical verification. A program is a mathematical ob- 
ject and can be proven correct. Unfortunately, such 
mathematical inspections are based on mathematical 
models that may not be accurate. No amount of mathe- 
matical analysis will reveal discrepancies between the 
model being used and the real situation; only testing 
can do that. Moreover, errors are often made in proofs. 
In mature engineering fields, mathematical methods 
and testing are viewed as complementary and mutually 
supportive. 

There is a need for an independent validation agency: It is 
impossible to test software completely and difficult to 
test one's own design in an unbiased way. A growing 
number of software development projects involve 
independent verification and validation (V&V). The 
V&V contractor is entirely independent of the devel- 
opment contractor. Sometimes a competitor of the 
development contractor is given the V&V contract. The 
testers work from the specification for the software and 
attempt to develop tests that will show the software to 
be faulty. One particularly interesting variation of this 
approach has been used within the IBM Federal Sys- 
tems Division. In IBM's clean room development ap- 
proach the authors of the software are not allowed 



to execute their programs. All testing is done by an 
independent tester and test reports are sent to the 
developer's supervisors. The test cases are chosen using 
random number generators and are intended to yield 
statistically valid data. It was hypothesized that the 
software would be written far more carefully under 
these conditions and would be more reliable. Early 
reports support the hypothesis [3, 9, 12]. 

It is important that these validation tests not be made 
available to the developers before the software is sub- 
mitted for testing. If the developers know what tests 
will be performed, they will use those tests in their 
debugging. The result is likely to be a program that will 
pass the tests but is not reliable in actual use. 

SOFTWARE REVIEWABILITY CONCERNS 

Why is reviewability a particular concern for software? 

Traditionally, engineers have approached software 
as if it were an art form. Each programmer has been 
allowed to have his own style. Criticisms of software 
structure, clarity, and documentation were dismissed as 
"matters of taste." 

In the past, engineers were rarely asked to examine 
a software product and certify that it would be trust- 
worthy. Even in systems that were required to be 
trustworthy and reliable, software was often regarded 
as an unimportant component, not requiring special 
examination. 

In recent years, however, manufacturers of a wide 
variety of equipment have been substituting computers 
controlled by software for a wide variety of more con- 
ventional products. We can no longer treat software as 
if it were trivial and unimportant. 

In the older areas of engineering, safety-critical com- 
ponents are inspected and reviewed to assure the de- 
sign is consistent with the safety requirements. To 
make this review possible, the designers are required to 
conform to industry standards for the documentation, 
and even the structure, of the product. The documenta- 
tion must be sufficiently clear and well organized that 
a reviewer can determine whether or not the design 
meets safety standards. The design itself must allow 
components to be inspected so the reviewer can verify 
they are consistent with the documentation. In con- 
struction, inspections take place during the process — 
while it is still possible to inspect and correct work that 
will later be hidden. 

When software is a safety-critical component, analo- 
gous standards should be applied. In software, there is 
no problem of physical visibility but there is a problem 
of clarity. Both practical experience and planned exper- 
iments have shown that it is common for programs 
with major flaws to be accepted by reviewers. In one 
particularly shocking experiment, small programs were 
deliberately flawed and given to a skilled reviewer 
team. The reviewers were unable to find the flaws in 
spite of the fact they were certain such flaws were 
present. In theory, nothing is invisible in a program— 
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it is all in the listing; in practice, poorly structured 
programs hide a plethora of problems. 

In safety-critical applications we must reject the 
"software-as-art-form" approach. Programs and docu- 
mentation must conform to standards that allow re- 
viewers to feel confident they understand the software 
and can predict how it will function in situations 
where safety depends on it. However, we must, equally 
strongly, reject standards that require a mountain of 
paper that nobody can read. The standards must insure 
clear, precise, and concise documentation. 

It is symptomatic of the immaturity of the software 
profession that there are no widely accepted software 
standards assuring the reviewability essential to licen- 
sing of software products that must be seen as trust- 
worthy. The documentation standards name and out- 
line certain documents, but they only vaguely define 
the contents of those documents. Recent U.S. military 
procurement regulations include safety requirements; 
while they require that safety checks be done, they 
neither describe how to do them nor impose standards 
that make those checks practicable. Most standards 
for code documentation are so vague and syntactic in 
nature that a program can meet those standards in spite 
of being incomprehensible. 

In the next section we derive some basic standards 
by considering the reviews that are needed and the 
information required by the reviewers. 

What reviews are needed? 

Software installed as a safety-critical component in 
a large system should be subjected to the following 
reviews: 

a. Review for correct intended function. If the software 
works as the programmers intend, will it meet the 
actual requirements? 

b. Review for maintainable, understandable, well doc- 
umented structure. Is it easy to find portions of the 
software relevant to a certain issue? Are the respon- 
sibilities of the various modules clearly defined? If 
all of the modules work as required, will the whole 
system work as intended? If changes are needed in 
the future, can those changes be restricted to easily 
identified portions of the code? 

c. Review each module to verify the algorithm and data 
structure design are consistent with the specified 
behavior. Is the data structure used in the module 
appropriate for representing the information main- 
tained by that module? If the programs are correctly 
coded, will the modules perform as required? Will 
the algorithms selected perform as required? These 
reviews must use mathematical methods; one can- 
not rely on intuitive approaches. We have found a 
formal review based on functional semantics, [10], 
to be practical and effective. 

d. Review the code for consistency with the algorithm 
and data structure design. Is the actual source code 
consistent with the algorithms and data structures 
described by the designers? Have the assemblers, 



compilers, and other support tools been used 
correctly? 

e. Review test adequacy. Was the testing sufficient to 
provide sound confidence in the proper functioning 
of the software? 

The structure of this set of reviews is consistent with 
modern approaches to software engineering. Because 
we are unable to comprehend all the critical details 
about a software product at once, it is necessary to 
provide documentation that allows programmers and 
reviewers to focus on one aspect at a time and to zoom 
in on the relevant details. 

Developing and presenting these views in the se- 
quence listed is the analogue of providing inspections 
during a construction project. Just as construction is 
inspected before further work obscures what has been 
done, the early specifications should be reviewed be- 
fore subsequent coding hides the structure in a sea of 
detail. 

The set of reviews also reflects the fact that review- 
ers of a software product have a variety of skills. Those 
who have a deep understanding of the requirements 
are not usually skilled software designers. It follows 
that the best people to review the functional behavior 
of the software are not the ones who should study the 
software. Similarly, within the software field we have 
people who are good at algorithm design, but not partic- 
ularly good finding an architecture for software prod- 
ucts. Skilled algorithm designers are not necessarily 
experts on a particular compiler or machine language. 
Those intimately familiar with a compiler or assembly 
language are not always good at organizing large pro- 
grams. When the software is safety critical, it is impor- 
tant that each of the five reviews be conducted by 
those best qualified to review that aspect of the 
work. 

Within this framework, all code and documentation 
supplied must be of a quality that facilitates review and 
allows the reviewers to be confident of their conclu- 
sions. It is the responsibility of the designers to present 
their software in a way that leaves no doubt about their 
correctness. It is not the responsibility of the reviewers 
to guess the designers' intent. Discrepancies between 
code and documentation must be treated as seriously as 
errors in the code. If the designers are allowed to be 
sloppy with their documentation, quality control will 
be ineffective. 

In the following sections of this article, we will de- 
scribe the documentation that must be provided for 
each of these reviews. This documentation should not 
be created merely for review purposes. It should be 
used throughout the development to record and propa- 
gate design decisions. When separate review documents 
are produced, projects experience all the problems of 
keeping two sets of books. Because of the complexity of 
software products, it is unlikely that both records 
would be consistent. Moreover, the documents de- 
scribed below from the reviewers' viewpoint are inval- 
uable to the designers as well [5, 13, 16]. 



What documentation is required to review the functional 
requirements? 

The software can be viewed as a control system 
whose output values respond to changes in the states 
of variables of interest in its environment. For many 
real-time systems, the desired outputs approximate 
piece-wise continuous functions of time and the history 
of the relevant environmental parameters. For other 
systems, the outputs are functions of a snapshot of the 
environmental parameters taken at some point in time. 
Some systems provide both reports and continuous 
outputs. 

The reviewers at this stage should be engineers and 
scientists who understand the situation being moni- 
tored and the devices to be controlled. They may not be 
computer specialists and should not be expected to 
read and understand programs. Because the require- 
ments could, in theory, be fulfilled by a completely 
hardware design, the description should use the mathe- 
matics of control systems, not the jargon and notation 
of computer programming. The functional require- 
ments can be stated precisely by giving three mathe- 
matical relations: (1) The required values of the con- 
trolled environmental variables in terms of the values 
of the relevant observable environmental parameters, 

(2) the computer inputs in terms of those observable 
environmental variables, and (3) the values of the con- 
trolled environmental variables in terms of the com- 
puter outputs. 

These requirements can be communicated as a set of 
tables and formulae describing the mathematical func- 
tions to be implemented [4]. We should not describe a 
sequence of computations anywhere in this document. 
The use of natural language, which inevitably intro- 
duces ambiguity, should be minimized. Documents of 
this form have been written for reasonably complex 
systems and are essential when safety-critical functions 
are to be performed. Our experience has shown that 
documents written this way can be thoroughly and 
effectively reviewed by engineers who are not program- 
mers. Some suggestions for organizing the reviews are 
contained in [19]. A complete example of such a 
document has been published as a model for other 
projects [17]. 

What documentation is required to review the software 
structure? 

For this review we require documents that describe 
the breakdown of the program into modules. Each mod- 
ule is a unit that should be designed, written and re- 
viewed independently of other modules. Each module 
is a collection of programs; the programs that can be 
invoked from other modules are called access pro- 
grams. The purpose of this review is to make sure that: 
(1) the structure is one that allows independent devel- 
opment and change; (2) all programs that are needed 
are included once and only once in the structure; 

(3) the interfaces to the modules are precisely de- 
fined; (4) the modules are compatible and will, to- 
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gether, constitute a system that meets the functional 
requirements. 

For this review three types of documents are re- 
quired. The first is the requirements specification, 
which should have been approved by an earlier review. 
The second is an informal document describing the re- 
sponsibilities of each module. The purpose of this mod- 
ule guide is to allow a reviewer to find all the modules 
relevant to a particular aspect of system design [1]. The 
third type of document is known as a module specifica- 
tion. It provides a complete black box description of the 
module interface. There should be one specification for 
each module mentioned in the module guide [2, 14]. 

Reviewers of these documents must be experienced 
software engineers. Some of them should have had ex- 
perience with similar systems. This experience is nec- 
essary to note omissions in the module structure. Dis- 
cussions of these documents and how to organize the 
reviews are contained in [14, 19]. 

What documentation is required to review the module's 
internal design? 

The first step in designing the module should be to 
describe the data structures that will be used and each 
proposed program's effect on the data. This information 
can be described in a way that is, except for the data 
types available, independent of the programming lan- 
guage being used. 

The design documentation is a description of two 
types of mathematical functions: program functions and 
abstraction functions. This terminology was used in 
IBM's Federal Systems Division, the IBM branch re- 
sponsible for U.S. Government systems. These concepts 
are described more fully elsewhere [11, 13]. The pro- 
gram functions, one for each module access program, 
give the mapping from the state before the program is 
executed to the state after the program terminates. The 
abstraction functions are used to define the "meaning" 
of the data structure; they give the mapping between 
the data states and abstract values visible to the users 
of the module. It is well-known that these functions 
provide sufficient information for a formal review of 
correctness of the design before the programs are 
implemented. 

Programs that cannot be described on a single page 
must be presented in a hierarchical way; each page 
must present a small program, calling other programs 
whose functions are specified on that page. This type of 
presentation allows the algorithm to be understood and 
verified one page at a time. 

If the module embodies a physical model (i.e., a set of 
equations that allows us to compute nonobservables 
from observables), the model must be described and its 
limitations documented. 

If the module performs numerical calculations in 
which accuracy will be a concern, numerical analysis 
justifying the design must be included. 

If the module is hardware-dependent, the documen- 
tation must include either a description of the hard- 
ware or a reference to such a description. 
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If the module is responsible for certain parts of the 
functional specification, a cross reference must be 
provided. 

The reviewers of each internal module design docu- 
ment will include experienced software engineers and 
other specialists. For example, if a physical model is 
involved, a physicist or engineer with expertise in that 
area must be included as a reviewer. If the information 
is presented in a notation that is independent of the 
programming language, none of the reviewers needs to 
be an expert in the programming language involved. 
Numerical analysts will be needed for some modules, 
device specialists for others. 

What documentation is required to review the code? 

While it is important that the algorithms and data 
structures be appropriate to the task, this will be of 
little help if the actual code is not faithful to the ab- 
stract design. Because of the previous reviews, those 
who review the code do not need to examine the global 
design of the system. Instead, they examine the corre- 
spondence between the algorithms and the actual code. 
These reviewers must be experienced users of the 
hardware and compilers involved; of course, they 
must also understand the notation used to specify 
the algorithms. 

What documentation is required for the Test Plan Review"! 

Although these reviews, if carried out rigorously, 
constitute a mathematical verification of the code, 
testing is still required. Sound testing requires that 
a test plan (a document describing the way test cases 
will be selected) be developed and approved in ad- 
vance. In addition to the usual engineering practice of 
normal case and limiting case checks, it is important 
that the reliability of safety-critical systems be esti- 
mated by statistical methods. Reliability estimation re- 
quires statistically valid random testing; careful thought 
must be given to the distribution from which the test 
cases will be drawn. It is important for the distribution 
of inputs to be typical of situations in which the correct 
functioning of the system is critical. A more detailed 
discussion of statistical testing can be found in the up- 
coming section. Reliability Assessment for Safety- 
Critical Software. 

The test plan should be described in a document that 
is not available to the designers. It should be reviewed 
by specialists in software testing, and specialists in the 
application area, who compare it with the require- 
ments specification to make certain the test coverage 
is adequate. 

Reviewing the relationship between these documents 

The hierarchical process described is designed to al- 
low reviews to be conducted in an orderly way, focus- 
ing on one issue at a time. To make this "separation of 
concerns" work, it is important that the required rela- 
tionships between the documents be verified. 

a. The module guide must show clearly that each of 
the mathematical functions described in the re- 



quirements specification is the responsibility of a 
specific module. There must be no ambiguity about 
the responsibilities of the various modules. The 
module specifications must be consistent with the 
module guide and the requirements specification. 

b. Each module design document should include argu- 
mentation showing that the internal design satisfies 
the module specification. If the module specification 
is mathematical [18], mathematical verification of 
the design correctness is possible [11]. 

c. The module design document, which describes the 
algorithms, must be clearly mapped onto the code. 
The algorithms may be described in an abstract no- 
tation or via hierarchically structured diagrams. 

d. The test plan must show how the tests are derived 
and how they cover the requirements. The test plan 
must include black box module tests as well as 
black box system tests. 

Why is configuration management essential for rigorous 
reviews? 

Because of the complexity of software, and the 
amount of detail that must be taken into consideration, 
there is always a tremendous amount of documenta- 
tion. Some of the most troublesome software errors 
occur when documents are allowed to get out-of-date 
while their authors work with pencil notes on their 
own copies. 

For the highly structured review process outlined 
earlier to succeed, all documents must be kept consis- 
tent when changes are made. If a document is changed, 
it, and all documents related to it, must be reviewed 
again. A careful review of the software may take weeks 
or months. Each reviewer must be certain that the doc- 
uments given to him are consistent and up-to-date. The 
time and energy of reviewers should not be wasted, 
comparing different versions of the same document. 

A process known in the profession as configuration 
management, supported by a configuration control 
mechanism, is needed to ensure that every designer 
and reviewer has the latest version of the documents 
and is informed of every change in a document that 
might affect the review. 

We should be exploiting computer technology to 
make sure that programmers, designers, and reviewers 
do not need to retain paper copies of the documents at 
all. Instead, they use online documentation. If a change 
must be made, all who have used the affected docu- 
ment should be notified of the change by the computer 
system. When a change is being considered, but is not 
yet approved, users of the document should receive a 
warning. The online versions must be kept under strict 
control so they cannot be changed without authoriza- 
tion. Every page must contain a version identifier that 
makes it easier for a reviewer to verify that the docu- 
ments he has used represent a consistent snapshot. 

MODULAR STRUCTURE 

Modern software engineering standards call for soft- 
ware to be organized in accordance with a principle 
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known variously as "Information Hiding," "Object- 
Oriented Programming," "Separation of Concerns," 
"Encapsulation," "Data Abstraction," etc. This principle 
is designed to increase the cohesion of the modules 
while reducing the "coupling" between modules. Sev- 
eral new textbooks, well-known programming lan- 
guages such as ADA, practical languages such as MESA, 
PROTEL, and MODULA, are designed to support such 
an organization. 

Any large program must be organized into pro- 
grammer work assignments known as modules. In 
information-hiding designs, each module hides a secret, 
a fact, or closely related set of facts, about the design 
that does not need to be known by the writers and 
reviewers of other modules. Each work assignment be- 
comes much simpler than in an old-fashioned design 
because it can be completed and understood without 
knowing much about the other modules. When changes 
are needed, they do not ripple through an unpredicta- 
ble number of other modules, as they frequently do in 
more conventional software designs. 

A number of practical systems illustrate the benefits 
of information hiding even when the designers did not 
use that abstract principle but depended on their own 
intuition. For example, the widely used UNIX operating 
system gains much of its flexibility from hiding the 
difference between files and devices. 

The thought of hiding information from others often 
strikes engineers as unnatural and wrong. In engineer- 
ing projects, careful scrutiny by others working on the 
project is considered an important part of quality con- 
trol. However, information hiding occurs naturally in 
large multidisciplinary projects. An electrical engineer 
may use a transformer without understanding its mo- 
lecular structure or knowing the size of the bolts that 
fasten it to a chassis. The circuit designer works with a 
specification that specifies such abstractions as voltage 
ratio, hysteresis curve, and linearity. Designers of large 
mechanical structures work with abstract descriptions 
of the girders and other components, not with the 
detailed molecular structures that are the concern of 
materials engineers. Large engineering projects would 
be impossible if every engineer on the project had to be 
familiar with all the details of every component of the 
product. 

Large software projects have the complexity of huge 
multidisciplinary projects, but there is only one dis- 
cipline involved. Consequently, information hiding 
does not occur naturally and must be introduced as an 
engineering discipline. Software engineers should be 
trained to provide and use abstract mathematical speci- 
fications of components just as other engineers do. 

The criterion of information hiding does not deter- 
mine the software structure. Software engineers try to 
minimize the information that one programmer must 
have about another's work. They also try to minimize 
the expected cost of a system over the period of its use. 
Both information and expected cost are probabilistic 
measures. For maximum benefit, one should hide those 



details most likely to change but does not need to hide 
facts that are fundamental and unlikely to change. Fur- 
ther, decisions likely to be changed and reviewed to- 
gether should be hidden in the same module. This im- 
plies that to apply the principle, one must make 
assumptions about the likelihood of various types of 
changes. If two designers apply the information-hiding 
principle, but make different assumptions about the 
likelihood of changes, they will come up with different 
structures. 

RELIABILITY ASSESSMENT FOR 
SAFETY-CRITICAL SOFTWARE 

Should we discuss the reliability of software at alll 

Manufacturers, users, and regulatory agencies are 
often concerned about the reliability of systems that 
include software. Over many decades, reliability engi- 
neers have developed sophisticated methods of estimat- 
ing the reliability of hardware systems based upon esti- 
mates of the reliability of their components. Software is 
often viewed as one of those components and an esti- 
mate of the reliability of that component is deemed 
essential to estimating the reliability of the overall 
system. 

Reliability engineers are often misled by their experi- 
ence with hardware. They are usually concerned with 
the reliability of devices that work correctly when new, 
but wear out and fail as they age. In other cases, they 
are concerned with mass-produced components where 
manufacturing techniques introduce defects that affect 
only a small fraction of the devices. Neither of these 
situations applies to software. Software does not wear 
out, and the errors introduced when software is copied 
have not been found to be significant. 

As a result of these differences, it is not uncommon 
to see reliability assessments for large systems based on 
an estimated software reliability of 1.0. Reliability engi- 
neers argue that the correctness of a software product is 
not a probabilistic phenomenon. The software is either 
correct (reliability 1.0) or incorrect (reliability 0). If 
they assume a reliability of 0, they cannot get a useful 
reliability estimate for the system containing the 
software. Consequently, they assume correctness. 
Many consider it nonsense to talk about "reliability 
of software." 

Nonetheless, our practical experience is that software 
appears to exhibit stochastic properties. It is quite use- 
ful to associate reliability figures such as MTBF (Mean 
Time Between Failures) with an operating system or 
other software product. Some software experts attribute 
the apparently random behavior to our ignorance. They 
believe that all software failures would be predictable if 
we fully understood the software, but our failure to 
understand our own creations justifies the treatment of 
software failures as random. However, we know that if 
we studied the software long enough, we could obtain a 
complete description of its response to inputs. Even 
then, it would be useful to talk about the MTBF of the 
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product. Hence, ignorance should not satisfy us as a 
philosophical justification. 

When a program first fails to function properly, it is 
because of an input sequence that had not occurred 
before. The reason that software appears to exhibit ran- 
dom behavior, and the reason that it is useful to talk 
about the MTBF of software, is because the input se- 
quences are unpredictable. When we talk about the 
failure rate of a software product, we are predicting the 
probability of encountering an input sequence that will 
cause the product to fail. 

Strictly speaking, we should not consider software as 
a component in systems at all. The software is simply 
the initial data in the computer and it is the initialized 
computer that is the component in question. However, 
in practice, the reliability of the hardware is high and 
failures caused by software errors dominate those 
caused by hardware problems. 

What should we be measuring? 

What we intuitively call "software reliability" is the 
probability of not encountering a sequence of inputs 
that leads to failure. If we could accurately characterize 
the sequences that lead to failure we would simply 
measure the distribution of input histories directly. Be- 
cause of our ignorance of the actual properties of the 
software, we must use the software itself to measure 
the frequency with which failure-inducing sequences 
occur as inputs. 

In safety-critical applications, particularly those for 
which a failure would be considered catastrophic, we 
may wish to take the position that design errors that 
would lead to failure are always unacceptable. In other 
technologies we would not put a system with a known 
design error in service. The complexity of software, and 
its consequent poor track record, means we seldom 
have confidence that software is free of serious design 
errors. Under those circumstances, we may wish to 
evaluate the probability that serious errors have been 
missed by our tests. This gives rise to our second proba- 
bilistic measure of software quality, trustworthiness. 

In the sequel we shall refer to the probability that an 
input will not cause a failure as the reliability of the 
software. We shall refer to the probability that no seri- 
ous design error remains after the software passes a set 
of randomly chosen tests as the trustworthiness of the 
software. We will discuss how to obtain estimates of 
both of these quantities. 

Some discussions about software systems use the 
terms availability and reliability as if they were inter- 
changeable. Availability usually refers to the fraction of 
time that the system is running and assumed to be 
ready to function. Availability can depend strongly on 
the time it takes to return a system to service once it 
has failed. If a system is truly safety-critical (e.g., a 
shutdown system in a nuclear power station), we would 
not depend on it during the time it was unavailable. 
The nuclear reactor would be taken out of service 
while its shutdown system was being repaired. Con- 



sequently, reliability and availability can be quite 
different. 

For systems that function correctly only in rare 
emergencies, we wish to measure the reliability in 
those situations where the system must take corrective 
action, and not include data from situations in which 
the system is not needed. The input sequence distribu- 
tions used in reliability assessment should be those that 
one would encounter in emergency situations, and not 
those that characterize normal operation. 

Much of the literature on software reliability is con- 
cerned with estimation and prediction of error-rates, 
the number of errors per line of code. For safety pur- 
poses, such rates are both meaningless and unimpor- 
tant. Error counts are meaningless because we cannot 
find an objective way to count errors. We can count the 
number of lines in the code that are changed to elimi- 
nate a problem, but there usually are many ways to 
alleviate that problem. If each approach to repairing the 
problem involves a different number of lines (which is 
usually the case), the number of errors in the code is a 
subjective, often arbitrary, judgment. Error counts are 
unimportant because a program with a high error count 
is not necessarily less reliable than one with a low 
error count. In other words, even if we could count the 
number of errors, reliability is not a function of the 
error count. If asked to evaluate a safety-critical soft- 
ware product, there is no point in attempting to esti- 
mate or predict the number of errors remaining in a 
program. 

Other portions of the literature are concerned with 
reliability growth models. These attempt to predict 
the reliability of the next (corrected) version on the 
basis of reliability data collected from previous ver- 
sions. Most assume the failure rate is reduced when- 
ever an error is corrected. They also assume the reduc- 
tions in failure rates resulting from each correction are 
predictable. These assumptions are not justified by 
either theoretical or empirical studies of programs. Re- 
liability growth models may be useful for management 
and scheduling purposes, but for safety-critical applica- 
tions one must treat each modification of the program 
as a new program. Because even small changes can 
have major effects, we should consider data obtained 
from previous versions of the program to be irrelevant. 

We cannot predict a software failure rate from failure rates 
for individual lines or subprograms. 

The essence of system-reliability studies is the com- 
putation of the reliability of a large system when given 
the reliability of the parts. It is tempting to try to do the 
same thing for software, but the temptation should be 
resisted. The lines or statements of a program are not 
analogous to the components of a hardware system. 
The components of a hardware system function inde- 
pendently and simultaneously. The lines of a computer 
program function sequentially and the effect of one ex- 
ecution depends on the state that results from the ear- 
lier executions. One failure at one part of a program 
may lead to many problems elsewhere in the code. 
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When evaluating the reliability of a safety-critical 
software product, the only sound approach is to treat 
the whole computer, hardware and software, as 
a black box. 

The finite slate machine model of programs 

The following discussion is based on the simplest and 
oldest model of digital computing. Used for more than 
50 years, this model recognizes that every digital com- 
puter has a finite number of states and there are only a 
finite number of possible input and output signals at 
any moment in time. Each machine is described by two 
functions: next-stale, and output. Both have a domain 
consisting of (state, input) pairs. The range of the next- 
state function is the set of states. The range of the 
output function is a set of symbols known as the output 
alphabet. These functions describe the behavior of a 
machine that starts in a specified initial state and pe- 
riodically selects new states and outputs in accordance 
with the functions. 

In this model, the software can be viewed as part of 
the initial data. It determines the initial state of the 
programmed machine. Von Neumann introduced a ma- 
chine architecture in which program and data could be 
intermixed. Practicing programmers know they can al- 
ways replace code with data or vice versa. It does not 
make sense to deal with the program and data as if they 
were different. 

In effect, loading a program in the machine selects a 
terminal submachine consisting of all states that can be 
reached from the initial state. The software can be 
viewed as a finite state machine described by two very 
large tables. This model of software allows us to define 
what we mean by the number of faults in the software; 
it is the number of entries in the table that specify 
behavior that would be considered unacceptable. This 
fault count has no simple relation to the number of 
errors made by the programmer or the number of state- 
ments that must be corrected to remove the faults. It 
serves only to help us to determine the number of tests 
that we need to perform. 

Use of hypothesis testing 

In most safety-critical applications we do not need to 
know the actual probability of failure; we need to con- 
firm the failure probability is very likely to be below a 
specified upper bound. We propose to run random tests 
on the software, checking the result of each test. Since 
we are concerned with safety-critical software, if a test 
fails (i.e., reveals an error in the software), we will 
change the software in a way that we believe will cor- 
rect the error. We will again begin random testing. We 
will continue such tests until we have sufficient data to 
convince us that the probability of a failure is accepta- 
bly low. Because we can execute only a very small 
fraction of the conceivable tests, we can never be sure 
that the probability of failure is low enough. We can, 
however, calculate the probability that a product with 
unacceptable reliability would have passed the test that 
we have carried out. 



TABLE I. Probability That a System With Failure Probability of 
.001 Will Pass N Successive Tests 



h=1000. 
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M= (1 - 1/h)« 


500 


0.60638 


600 


0.54865 


700 


0.49641 


800 


0.44915 


900 


0.40639 


1000 


0.3670 


1500 


0.22296 


2000 


0.13520 


2500 


0.08198 


3000 


0.04971 


3500 


0.03014 


4000 


0.01828 


4500 


0.01108 


4700 


0.00907 


5000 


0.00672 



Let us assume the probability of a failure in a test 
of a program is \/h (i.e., the reliability is 1 - l/h). 
Assuming that N randomly selected tests (chosen, with 
replacement, from a distribution that corresponds to 
the actual usage of the program) are performed, the 
probability there will be no failure encountered during 
the testing is 

(1 - l/h)» = M. (1) 

In other words, if we want the failure probability to 
be less than \/h, and we have run N tests without 
failure, the probability that an unacceptable product 
would pass our test is no higher than M. We must 
continue testing, without failure, until N is large 
enough to make M acceptably low. We could then 
make statements like, "the probability that a product 
with reliability worse than .999 would pass this test is 
less than one in a hundred." Table I provides some 
sample values of M for ft = 1000 and various values 
of N. 

Table I shows that, if our design target was to have 
the probability of failure be less than 1 in 1000, per- 
forming between 4500 and 5000 tests (randomly chosen 
from the appropriate test case distribution) without fail- 
ure would mean that the probability of an unacceptable 
product passing the test was less than 1 in a hundred. 

Because the probability of failure in practice is a 
function of the distribution of cases encountered in 
practice, the validity of this approach depends on the 
distribution of cases in the tests being typical of the 
distribution of cases encountered in practice. 

We can consider using the same approach to obtain a 
measure of the trustworthiness of a program. Let the 



June 1990 Volume 33 Number 6 



Communications of the ACM 



total number of cases from which we select tests bo C. 
Assume we consider it unacceptable if F of those cases 
results in faulty behavior; (F might be 1). By substitut- 
ing F/C for 1//) we obtain 



TABLE II. Probability That a System With Failure Probability o 
.000001 Will Pass N Successive Tests 



(1 - F/C) N = M. 



(2) 



We now assume that we have carried out N ran- 
domly selected tests without finding an error. If, during 
that testing, we had found an error, we would have 
corrected the problem and started again. We can esti- 
mate the value of C, and must determine whether to 
use F = 1 or some higher value. We might pick a higher 
number if we thought it unlikely that there would be 
only 1 faulty (state, input) pair. In most computer pro- 
grams, a programming error would result in many 
faulty pairs, and calculations using F = 1 are unneces- 
sarily pessimistic. After choosing F, we can determine 
M as above. (F, M) pairs provide a measure of trust- 
worthiness. Note that systems considered trustworthy 
would have relatively low values of M and F. 

As a result of such tests we could make statements 
like, "The probability that a program with more than 
five unacceptable cases would pass this test is one in a 
hundred." Since we are not concerned with the fre- 
quency of failure of those cases in practice, the tests 
should be chosen from a distribution in which all state 
input combinations are equally likely. Because C is al- 
most always large and F relatively small, it is not prac- 
tical to evaluate trustworthiness by means of testing. 
Trustworthiness, in the sense that we have defined it 
here, must be obtained by means of formal, rigorous 
inspections. 

It is common to try to achieve high reliability by 
using two or more programs in an arrangement that 
will be safe if one of their specified subsets fails. For 
example, one could have two safety systems and make 
sure that each one could alone take the necessary ac- 
tions in an emergency. If the system failures are statisti- 
cally independent, the probability of the joint system 
failing is the product of the probability of individual 
failures. Unfortunately, repeated experiments have 
shown that, even when the programs for the two sys- 
tems are developed independently, the failures are cor- 
related [6, 7, 8]. As a result, we should evaluate the 
probability of joint failure experimentally. 

The hypothesis testing approach can be applied to the 
evaluation of the probability of joint failures of two (or 
more) systems. Both systems must be subjected to the 
same set of test conditions. Joint failures can be de- 
tected. However, because the permitted probability of 
failures for joint systems is much lower than for single 
systems, many more tests will be needed. Table II 
shows some typical values. 

In this table, we have been quite vague about the 
nature of a single test and have focused on how many 
tests are needed. Next we will discuss what constitutes 
a test and how to select one or more tests. 

Three classes of programs 
The simplest class of programs to test comprises 
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0.01832 


2000000. 


0.13534 




4100000. 


0.01657 
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0.04979 




4200000. 


0.01500 


4000000. 


0.01832 




4300000. 


0.01357 
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0.00674 
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0.01228 
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0.00248 




4500000. 


0.01111 
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0.00091 




4600000. 


0.01005 
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0.00034 
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0.00910 


9000000. 


0.00012 




4800000. 


0.00823 


10000000. 


0.00005 




4900000. 


00745 



those that terminate after each use and retain no data 
from one run to the next. These memoryless batch pro 
grams are provided with data, executed, and return an 
answer that is independent of any data provided in 
earlier executions. 

A second class consists of batch programs that retain 
data from one run to the next. The behavior of such 
programs on the nth run can depend on data supplied 
in any previous run. 

A third class contains programs that appear to run 
continuously. Often these real-time programs are in- 
tended to emulate or replace analogue equipment. 
They consist of one or more processes; some of those 
processes run periodically, others run sporadically in 
response to external events. One cannot identify dis- 
crete runs, and the behavior at any time may depend 
on events arbitrarily far in the past. 

Reliability estimates for memoryless batch programs: For 
memoryless batch programs a test consists of a single . 
run using a randomly selected set of input data. If we 
are concerned with a system required to take action in 
rare circumstances, and one in which action in other 
circumstances is inconvenient rather than unsafe, the 
population of possible test cases should be restricted to 
those in which the system should take action. It is es- 
sential that one know the reliability under those cir- 
cumstances. Of course, additional tests can be con- 
ducted, using other data, to determine the probability 
of action being taken when no action is required. 

Reliability estimates for batch programs with memory: 
When a batch program has memory, a test consists of a 
single run. However, a test case is selected by choosing 
both input data and an internal state. For reliability 
estimates, the distribution of internal states must match 
that encountered in practice. It is often more difficult tc 
determine the appropriate distribution of internal states 
than to find the distribution of inputs. Determining the 
distribution of internal states requires an understanding 
of, and experience with, the program. 
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An alternative to selecting internal states for the test 
would be to have each test consist of a sequence of 
executions. The system must be reinitialized before 
each new sequence. Again, the distribution of these 
cases must match that found in practice if the reliabil- 
ity estimates are to be meaningful. In addition, it is 
difficult to determine the length of those sequences. 
The sequences must be longer than the longest se- 
quence that would occur in actual use. If the sequences 
are not long enough, the distribution of internal states 
that occur during the test may be badly skewed. In 
effect, this means that in actual use, the system must be 
reinitialized frequently so that an upper bound can be 
placed on the length of each test. 

Reliability estimates for real-time systems: In real-time 
systems, the concept of a batch run does not apply. 
Because the real-time system is intended to simulate or 
replace an analogue system, the concept of an input 
sequence must be replaced by a multidimensional tra- 
jectory. Each such trajectory gives the input values as 
continuous functions of time. Each test involves a sim- 
ulation in which the software can sample the inputs for 
the length of that trajectory. 

The question of the length of the trajectory is critical 
in determining whether or not statistical testing is prac- 
tical. In many computer systems there are states that 
can arise only after long periods of time. Reliability 
estimates derived from tests involving short trajectories 
will not be valid for systems that have been operating 
for longer periods. On the other hand, if one selects 
lengthy trajectories, the testing time required is likely 
to be impractical. 

Statistical testing can be made practical if the system 
design is such that one can limit the length of the tra- 
jectories without invalidating the tests. To do this, one 
must partition the state. A small amount of the memory 
is reserved for data that must be retained for arbitrary 
amounts of time. The remaining data are reinitialized 
periodically. The length of the period becomes the 
length of the test trajectory. Testing can then proceed 
as if the program were a batch program with (memory- 
state, trajectory) pairs replacing input sequences. 

If the long-term memory has a small number of 
states, it is best to perform statistically significant tests 
for each of those states. If that is impractical, one must 
select the states randomly in accordance with a pre- 
dicted distribution. In many applications, the long-term 
memory corresponds to operating modes and a valid 
distribution can be determined. 

Picking test cases for safety-critical real-time systems 

Particular attention must be paid to trajectory selec- 
tion if the system is required to act only in rare circum- 
stances. Since the reliability is a function of the input 
distribution, the trajectories must be selected to provide 
accurate estimates under the conditions where perfor- 
mance matters. In other words, the population from 
which trajectories are drawn must include only trajec- 
tories in which the system must take action. Similarly, 



the states of the long-'erm memory should be restricted 
to those in which the system will be critical to safety. 

Determining the population of trajectories from 
which the tests are selected can be the most difficult 
part of the process. It is important to use one's knowl- 
edge of the physical situation to define a set of trajecto- 
ries that can occur. Tests on impossible trajectories are 
not likely to lead to accurate reliability estimates. How- 
ever, there is always the danger that the model used to 
determine these trajectories overlooks the same situa- 
tion overlooked by the programmer who introduced a 
serious bug. It is important that any model used to 
eliminate impossible trajectories be developed indepen- 
dently of the program. Most safety experts would feel 
more comfortable if, in addition to the tests using tra- 
jectories considered possible, some statistical tests were 
conducted with crazy trajectories. 

CONCLUSIONS 

There is no inherent reason that software cannot be 
used in certain safety-critical applications, but extreme 
discipline in design, documentation, testing, and review 
is needed. It is essential that the operating conditions 
and requirements be well understood, and fully doc- 
umented. If these conditions are not met, adequate 
review and testing are impossible. 

The system must be structured in accordance with 
information hiding to make it easier to understand, re- 
view, and repair. The documentation must be complete 
and precise, making use of mathematical notation 
rather than natural language. Each stage of the design 
must be reviewed by independent reviewers with the 
specialized knowledge needed at that stage. Mathemati- 
cal verification techniques must be used to make the 
review systematic and rigorous. 

An independent agency must perform statistically 
valid random testing to provide estimates of the relia- 
bility of the system in critical situations. Deep knowl- 
edge and experience with the application area will be 
needed to determine the distribution from which the 
test cases should be drawn. 

The vast literature on random testing is, for the most 
part, not relevant for safety evaluations. Because we are 
not interested in estimating the error rates or conduct- 
ing reliability growth studies, a very simple model suf- 
fices. Hypothesis testing will allow us to evaluate the 
probability that the system meets our requirements. 
Testing to estimate reliability is only practical if a real- 
time system has limited long-term memory. 

Testing to estimate trustworthiness is rarely practical 
because the number of tests required is usually quite 
large. Trustworthiness must be assured by the use 
of rigorous mathematical techniques in the review 
process. 

The safety and trustworthiness of the system will rest 
on a tripod made up of testing, mathematical review, 
and certification of personnel and process. In this arti- 
cle, we have focused on two of those legs, testing and 
review based on mathematical documentation. The 
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third leg will bn the most difficult to implement. While 
there are authorities that certify professional engineers 
in other areas, there is no corresponding authority in 
software engineering. We have found that both classi- 
cal engineers and computer science graduates are ill- 
prepared for this type of work. In the long term, those 
who are concerned about the use of software in safety- 
critical applications will have to develop appropriate 
educational programs (15]. 
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